Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Configure browsertrix proxies #1847

Merged
merged 58 commits into from
Oct 3, 2024
Merged

Configure browsertrix proxies #1847

merged 58 commits into from
Oct 3, 2024

Conversation

vnznznz
Copy link
Contributor

@vnznznz vnznznz commented Jun 4, 2024

Resolves #1354

Current state of my socks proxy configuration work.

Done:

  • add Dockerfile for browsertrix-proxy, currently just a container for openssh-client but could used to facilitate more proxy connections in the future
  • add crawler_socks_proxy_servers helm config setting to set a list ssh servers that can be used to create a socks proxy connection
  • add backend + frontend support to set crawler proxy server for a workflow and run crawl through proxy

Todos

  • improve naming, currently it's crawler_socks_proxy_server or crawlerSocksProxyServer which can either mean the id of the server or the full object or a subset without the sensitive data.
  • handle workflow updates, once set a socks proxy server can't be changed for a workflow atm
  • figure out a way to do CI tests
  • document the feature
    • add setup information for ssh server
  • review data model usage, do I have to store the socks proxy server settings in more places, is it okay to just store the id?
  • review pod lifecycle, do I configure the crawlconfig, params, etc in the correct place?
  • error handling, what happens when a socks proxy server leaves the config?
  • fix socks proxy handling in browsertrix-crawler, it broke with the removal of pywb, see proxy: support setting proxy via --proxyServer, PROXY_SERVER env var or PROXY_HOST + PROXY_PORT env vars browsertrix-crawler#589
  • add socks proxy selection to profilebrowser
  • add socks proxy selection when updating browser profile
  • add proxy selection to org defaults

Needed for local testing:

  • Browsertrix Crawler 1.3.0+ (released)

@Shrinks99 Shrinks99 changed the title WIP: Configure socks proxies, resolves #1354 WIP: Configure socks proxies Jun 5, 2024
vnznznz and others added 3 commits July 30, 2024 17:08
… just use proxies

just pass proxyId to CrawlJob, lookup in operator
share single secret for all proxy configs, containg private keys and public host keys
use single 'auth' field for 'user@host[:port]'
secrets: map /etc/passwd and /etc/group to ensure user/group are defined for ssh
@ikreymer ikreymer marked this pull request as draft July 31, 2024 05:05
@ikreymer ikreymer requested a review from tw4l July 31, 2024 05:05
docs/deploy/proxies.md Outdated Show resolved Hide resolved
docs/deploy/proxies.md Outdated Show resolved Hide resolved
docs/user-guide/workflow-setup.md Outdated Show resolved Hide resolved
@tw4l tw4l marked this pull request as ready for review October 2, 2024 19:58
@tw4l
Copy link
Contributor

tw4l commented Oct 2, 2024

@ikreymer Proxy documentation has been updated based on @vnznznz's comments whenever you're ready to take a look.

@tw4l tw4l changed the title WIP: Configure browsertrix proxies Configure browsertrix proxies Oct 2, 2024
@ikreymer ikreymer merged commit bb6e703 into main Oct 3, 2024
5 checks passed
@ikreymer ikreymer deleted the configure-socks-proxies branch October 3, 2024 01:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Feature]: Support crawling through pre-configured SOCKS5 proxies
3 participants